Analitica Descriptiva#
Importacion de paquetes#
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.tree import DecisionTreeClassifier
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.formula.api import logit
import statsmodels.api as sm
from scipy import stats
import psycopg2
Carga de datos#
Empezaremos el proceso de exploratorio de los datos con la carga de los mismos:
conn = psycopg2.connect(
dbname="neondb",
user="neondb_owner",
password="YexCUXob4Oy7",
host="ep-little-feather-a5rorog4.us-east-2.aws.neon.tech",
port="5432"
)
cursor = conn.cursor()
cursor.execute("SELECT * FROM companies")
# Obtener los datos y guardarlos en un DataFrame
data2 = cursor.fetchall()
data = pd.DataFrame(data2)
data.columns=['company', 'sector', 'horizon_days', 'amount', 'date_BUY_fix',
'date_SELL_fix', 'price_BUY', 'price_SELL', 'Volatility_Buy',
'Volatility_sell', 'Sharpe_Ratio', 'expected_return_yearly',
'inflation', 'nominal_return', 'investment', 'ESG_ranking', 'PE_ratio',
'EPS_ratio', 'PS_ratio', 'PB_ratio', 'NetProfitMargin_ratio',
'current_ratio', 'roa_ratio', 'roe_ratio']
Visualizaremos los datos:
data=pd.DataFrame(data)
data.head()
| company | sector | horizon_days | amount | date_BUY_fix | date_SELL_fix | price_BUY | price_SELL | Volatility_Buy | Volatility_sell | ... | investment | ESG_ranking | PE_ratio | EPS_ratio | PS_ratio | PB_ratio | NetProfitMargin_ratio | current_ratio | roa_ratio | roe_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HMC | AUTO | 7 | 500.0 | 2016-06-14 | 2016-06-21 | 25.600000 | 26.410000 | 0.274761 | 0.274900 | ... | GOOD | 28.5 | 13.50 | 1.64 | 0.32 | 0.00 | 2.40 | 1.11 | 1.93 | 4.87 |
| 1 | MSFT | TECH | 330 | 1500.0 | 2016-07-22 | 2017-06-16 | 52.356224 | 66.406387 | 0.279043 | 0.160044 | ... | GOOD | 14.7 | 23.18 | 2.05 | 4.19 | 5.15 | 22.53 | 2.35 | 11.28 | 27.29 |
| 2 | BAC | BANK | 15 | 25000.0 | 2015-01-28 | 2015-02-12 | 13.609470 | 14.925651 | 0.228531 | 0.232711 | ... | GOOD | 26.3 | 42.16 | 0.38 | 1.77 | 0.69 | 4.62 | 0.94 | 0.21 | 1.88 |
| 3 | KSS | RETAIL | 270 | 50000.0 | 2015-01-28 | 2015-10-26 | 47.355690 | 36.093388 | 0.221473 | 0.292144 | ... | BAD | 12.9 | 10.87 | 4.26 | 0.50 | 1.55 | 4.56 | 1.95 | 6.03 | 14.78 |
| 4 | AAPL | TECH | 600 | 10000.0 | 2015-05-21 | 2017-01-10 | 30.166021 | 28.201811 | 0.217025 | 0.227415 | ... | BAD | 16.5 | 14.06 | 2.02 | 3.19 | 5.08 | 22.53 | 1.16 | 19.56 | 39.44 |
5 rows × 24 columns
Por medio de las funciones describe() y info() obtendremos una visión preliminar del tipo de datos que tenemos del DataFrame.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company 900 non-null object
1 sector 900 non-null object
2 horizon_days 900 non-null int64
3 amount 900 non-null float64
4 date_BUY_fix 900 non-null object
5 date_SELL_fix 900 non-null object
6 price_BUY 900 non-null float64
7 price_SELL 900 non-null float64
8 Volatility_Buy 900 non-null float64
9 Volatility_sell 900 non-null float64
10 Sharpe_Ratio 900 non-null float64
11 expected_return_yearly 900 non-null float64
12 inflation 900 non-null float64
13 nominal_return 900 non-null float64
14 investment 900 non-null object
15 ESG_ranking 900 non-null float64
16 PE_ratio 900 non-null float64
17 EPS_ratio 900 non-null float64
18 PS_ratio 900 non-null float64
19 PB_ratio 900 non-null float64
20 NetProfitMargin_ratio 900 non-null float64
21 current_ratio 900 non-null float64
22 roa_ratio 900 non-null float64
23 roe_ratio 900 non-null float64
dtypes: float64(18), int64(1), object(5)
memory usage: 168.9+ KB
Manipulación de la data#
Podemos observar como la variable “Unname: 0”, “company”, “date_BUY_fix” y “date_SELL_fix” no aportan significado al proceso predictivo al brindar información muy indiviual de cada dato. Por lo que, los eliminamos.
clean = ["company", "date_BUY_fix", "date_SELL_fix"]
data_1 = data.drop(clean, axis =1)
data_1.head()
| sector | horizon_days | amount | price_BUY | price_SELL | Volatility_Buy | Volatility_sell | Sharpe_Ratio | expected_return_yearly | inflation | ... | investment | ESG_ranking | PE_ratio | EPS_ratio | PS_ratio | PB_ratio | NetProfitMargin_ratio | current_ratio | roa_ratio | roe_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AUTO | 7 | 500.0 | 25.600000 | 26.410000 | 0.274761 | 0.274900 | 0.274761 | -0.000176 | -0.2 | ... | GOOD | 28.5 | 13.50 | 1.64 | 0.32 | 0.00 | 2.40 | 1.11 | 1.93 | 4.87 |
| 1 | TECH | 330 | 1500.0 | 52.356224 | 66.406387 | 0.279043 | 0.160044 | 0.279043 | 0.197311 | -0.2 | ... | GOOD | 14.7 | 23.18 | 2.05 | 4.19 | 5.15 | 22.53 | 2.35 | 11.28 | 27.29 |
| 2 | BANK | 15 | 25000.0 | 13.609470 | 14.925651 | 0.228531 | 0.232711 | 0.228531 | 0.007771 | -0.5 | ... | GOOD | 26.3 | 42.16 | 0.38 | 1.77 | 0.69 | 4.62 | 0.94 | 0.21 | 1.88 |
| 3 | RETAIL | 270 | 50000.0 | 47.355690 | 36.093388 | 0.221473 | 0.292144 | 0.221473 | 0.005071 | -0.5 | ... | BAD | 12.9 | 10.87 | 4.26 | 0.50 | 1.55 | 4.56 | 1.95 | 6.03 | 14.78 |
| 4 | TECH | 600 | 10000.0 | 30.166021 | 28.201811 | 0.217025 | 0.227415 | 0.217025 | 0.403993 | -0.5 | ... | BAD | 16.5 | 14.06 | 2.02 | 3.19 | 5.08 | 22.53 | 1.16 | 19.56 | 39.44 |
5 rows × 21 columns
Pie Plot#
conteo_categorias = data_1['investment'].value_counts()
# Crear un DataFrame con los datos del conteo de categorías
df_conteo = pd.DataFrame({'investment': conteo_categorias.index, 'investment': conteo_categorias.values})
# Crear el gráfico de pastel con Plotly
fig = px.pie(df_conteo, values='investment', names='investment',
title='Distribución de la Variable investment')
# Mostrar el gráfico
fig.show()
Histograma#
fig = px.histogram(data_1, x='investment', color='investment',
labels={'investment': 'invesment', 'count': 'Frecuencia'},
title='Distribución de la Variable Investment')
# Actualizar diseño del gráfico
fig.update_layout(xaxis_title='Variable Categórica',
yaxis_title='Frecuencia')
# Mostrar el gráfico
fig.show()
# Filtrar datos para deslizamientos igual a 1 y 0
df_desl_1 = data_1[data_1['investment'] == 'GOOD']
df_desl_0 = data_1[data_1['investment'] == 'BAD']
# Crear histograma para deslizamientos igual a 1
histogram_1 = go.Histogram(x=df_desl_1["inflation"], opacity=0.5, name='investment = GOOD')
# Crear histograma para deslizamientos igual a 0
histogram_0 = go.Histogram(x=df_desl_0["inflation"], opacity=0.5, name='investment = BAD')
# Configurar el diseño del gráfico
layout = go.Layout(title='Histograma de inflation',
xaxis=dict(title='inflation'),
yaxis=dict(title='Frecuencia'),
barmode='overlay')
# Crear la figura y agregar los histogramas
fig = go.Figure(data=[histogram_1, histogram_0], layout=layout)
# Mostrar el gráfico
fig.show()
# Filtrar datos para deslizamientos igual a 1 y 0
df_desl_1 = data_1[data_1['investment'] == 'GOOD']
df_desl_0 = data_1[data_1['investment'] == 'BAD']
# Crear histograma para deslizamientos igual a 1
histogram_1 = go.Histogram(x=df_desl_1["nominal_return"], opacity=0.5, name='investment = GOOD')
# Crear histograma para deslizamientos igual a 0
histogram_0 = go.Histogram(x=df_desl_0["nominal_return"], opacity=0.5, name='investment = BAD')
# Configurar el diseño del gráfico
layout = go.Layout(title='Histograma de nominal_return',
xaxis=dict(title='inflation'),
yaxis=dict(title='Frecuencia'),
barmode='overlay')
# Crear la figura y agregar los histogramas
fig = go.Figure(data=[histogram_1, histogram_0], layout=layout)
# Mostrar el gráfico
fig.show()
# Filtrar datos para deslizamientos igual a 1 y 0
df_desl_1 = data_1[data_1['investment'] == 'GOOD']
df_desl_0 = data_1[data_1['investment'] == 'BAD']
# Crear histograma para deslizamientos igual a 1
histogram_1 = go.Histogram(x=df_desl_1["amount"], opacity=0.5, name='investment = GOOD')
# Crear histograma para deslizamientos igual a 0
histogram_0 = go.Histogram(x=df_desl_0["amount"], opacity=0.5, name='investment = BAD')
# Configurar el diseño del gráfico
layout = go.Layout(title='Histograma de nominal_return',
xaxis=dict(title='amount'),
yaxis=dict(title='Frecuencia'),
barmode='overlay')
# Crear la figura y agregar los histogramas
fig = go.Figure(data=[histogram_1, histogram_0], layout=layout)
# Mostrar el gráfico
fig.show()
# Filtrar datos para deslizamientos igual a 1 y 0
df_desl_1 = data_1[data_1['investment'] == 'GOOD']
df_desl_0 = data_1[data_1['investment'] == 'BAD']
# Crear histograma para deslizamientos igual a 1
histogram_1 = go.Histogram(x=df_desl_1["Sharpe_Ratio"], opacity=0.5, name='investment = GOOD')
# Crear histograma para deslizamientos igual a 0
histogram_0 = go.Histogram(x=df_desl_0["Sharpe_Ratio"], opacity=0.5, name='investment = BAD')
# Configurar el diseño del gráfico
layout = go.Layout(title='Histograma de Sharpe_Ratio',
xaxis=dict(title='inflSharpe_Ratioation'),
yaxis=dict(title='Frecuencia'),
barmode='overlay')
# Crear la figura y agregar los histogramas
fig = go.Figure(data=[histogram_1, histogram_0], layout=layout)
# Mostrar el gráfico
fig.show()
# Filtrar datos para deslizamientos igual a 1 y 0
df_desl_1 = data_1[data_1['investment'] == 'GOOD']
df_desl_0 = data_1[data_1['investment'] == 'BAD']
# Crear histograma para deslizamientos igual a 1
histogram_1 = go.Histogram(x=df_desl_1["roa_ratio"], opacity=0.5, name='investment = GOOD')
# Crear histograma para deslizamientos igual a 0
histogram_0 = go.Histogram(x=df_desl_0["roa_ratio"], opacity=0.5, name='investment = BAD')
# Configurar el diseño del gráfico
layout = go.Layout(title='Histograma de roa_ratio',
xaxis=dict(title='roa_ratio'),
yaxis=dict(title='Frecuencia'),
barmode='overlay')
# Crear la figura y agregar los histogramas
fig = go.Figure(data=[histogram_1, histogram_0], layout=layout)
# Mostrar el gráfico
fig.show()
Podemos evidenciar como las variables se distribuyen y como esto se relaciona con la categoría investment.
Diagrama de dispersióin#
# Crear el gráfico de dispersión con Plotly
fig = px.scatter(data_1, x='inflation', y='nominal_return', color= "investment",
title='Diagrama de Dispersión',
labels={'inflation': 'inflation', 'nominal_return': 'nominal_return'})
# Mostrar el gráfico
fig.show()
# Crear el gráfico de dispersión con Plotly
fig = px.scatter(data_1, x='inflation', y='amount', color= "investment",
title='Diagrama de Dispersión',
labels={'inflation': 'inflation', 'amount': 'amount'})
# Mostrar el gráfico
fig.show()
# Crear el gráfico de dispersión con Plotly
fig = px.scatter(data_1, x='inflation', y='Sharpe_Ratio', color= "investment",
title='Diagrama de Dispersión',
labels={'inflation': 'inflation', 'Sharpe_Ratio': 'Sharpe_Ratio'})
# Mostrar el gráfico
fig.show()
# Crear el gráfico de dispersión con Plotly
fig = px.scatter(data_1, x='inflation', y='expected_return_yearly', color= "investment",
title='Diagrama de Dispersión',
labels={'inflation': 'inflation', 'expected_return_yearly': 'expected_return_yearly'})
# Mostrar el gráfico
fig.show()
fig = px.scatter_3d(data_1, x='inflation', y='nominal_return', z='expected_return_yearly', color='investment',
title='Gráfico de Dispersión',
labels={'inflation': 'inflation', 'nominal_return': 'nominal_return', 'expected_return_yearly': 'expected_return_yearly', 'investment': 'investment'})
# Mostrar el gráfico
fig.show()
Comparando con variable Respuesta#
Podemos observar como las variables se comportan según la categoría de investment que presenten. siendo así como “inflation” y “nominal_ return” cuentan con el comportamiento más particular.
for column in data_1.columns:
fig = px.box(data_1, x= "investment", y=column, color = "investment")
fig.show()
Es posible observar como se evidenciaba en el arbol de clasificación que los variables con mayor incidencia eran Inflation y nominal return.
Sector, horizon day, amount, price sell, price buy, volatility buy, volatility sell, sharpe ratio, expect return, ESG ranking, PE ratio, EPS ratio, PS ratio, NetProfitmargin, current ratio, roa ratio y roe ratio no presentan aparentemente una fuerte tendencia hacia una categoria. Es decir, visualmente, no parecen ser determinantes en clasificar a una inversión como GOOD o BAD.
fig = px.histogram(data, x='sector', color='investment')
fig.show()
Podemos visualizar que no hay presencia de un sector que presente una predominante tendencia de buena inversión o mala inversión. Y al ser la unica variable categorica, se decide no tenerla en consideración en el entrenamiento del modelo.
data_1= data_1.drop("sector", axis=1)
Correlaciones#
Usar correlación para verificar que datos se las variables independientes no cuenten con una relación lineal.
numeric_df = data_1.select_dtypes(include=['int', 'float'])
# Calcular matriz de correlación
matriz_correlacion = numeric_df.corr()
matriz_correlacion
| horizon_days | amount | price_BUY | price_SELL | Volatility_Buy | Volatility_sell | Sharpe_Ratio | expected_return_yearly | inflation | nominal_return | ESG_ranking | PE_ratio | EPS_ratio | PS_ratio | PB_ratio | NetProfitMargin_ratio | current_ratio | roa_ratio | roe_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| horizon_days | 1.000000 | 0.057634 | -0.019103 | 0.041087 | -0.000782 | 0.079032 | -0.000782 | 0.557271 | 0.000982 | 0.314202 | -0.056352 | -0.030855 | 0.043956 | 0.058744 | 0.000010 | 0.049919 | 0.063964 | 0.059054 | 0.045002 |
| amount | 0.057634 | 1.000000 | 0.015068 | 0.019191 | 0.036155 | -0.017119 | 0.036155 | 0.022659 | 0.024025 | 0.035272 | 0.041535 | -0.025461 | -0.005007 | 0.003251 | 0.026403 | -0.072575 | -0.033395 | -0.089766 | -0.098832 |
| price_BUY | -0.019103 | 0.015068 | 1.000000 | 0.975333 | -0.074075 | -0.054223 | -0.074075 | 0.105182 | 0.091847 | 0.046071 | 0.060159 | 0.317998 | 0.679935 | 0.219801 | 0.367833 | 0.112021 | 0.161695 | 0.158972 | 0.004417 |
| price_SELL | 0.041087 | 0.019191 | 0.975333 | 1.000000 | -0.071000 | -0.062839 | -0.071000 | 0.165045 | 0.083859 | 0.142205 | 0.058317 | 0.304729 | 0.698000 | 0.222112 | 0.350046 | 0.122658 | 0.172211 | 0.161950 | 0.002011 |
| Volatility_Buy | -0.000782 | 0.036155 | -0.074075 | -0.071000 | 1.000000 | 0.850632 | 1.000000 | 0.166545 | 0.050158 | 0.065940 | -0.079210 | 0.033027 | -0.225247 | 0.237593 | 0.230903 | -0.156604 | 0.194057 | -0.054318 | -0.265809 |
| Volatility_sell | 0.079032 | -0.017119 | -0.054223 | -0.062839 | 0.850632 | 1.000000 | 0.850632 | 0.171310 | 0.070344 | -0.039453 | -0.123275 | 0.035551 | -0.211890 | 0.187969 | 0.176131 | -0.175065 | 0.175390 | -0.040506 | -0.258205 |
| Sharpe_Ratio | -0.000782 | 0.036155 | -0.074075 | -0.071000 | 1.000000 | 0.850632 | 1.000000 | 0.166545 | 0.050158 | 0.065940 | -0.079210 | 0.033027 | -0.225247 | 0.237593 | 0.230903 | -0.156604 | 0.194057 | -0.054318 | -0.265809 |
| expected_return_yearly | 0.557271 | 0.022659 | 0.105182 | 0.165045 | 0.166545 | 0.171310 | 0.166545 | 1.000000 | 0.022885 | 0.533747 | -0.088662 | 0.085136 | 0.038070 | 0.391668 | 0.361000 | 0.199539 | 0.263973 | 0.243156 | -0.021053 |
| inflation | 0.000982 | 0.024025 | 0.091847 | 0.083859 | 0.050158 | 0.070344 | 0.050158 | 0.022885 | 1.000000 | 0.003025 | -0.004853 | -0.007222 | -0.008672 | 0.048510 | 0.116725 | -0.034308 | -0.009438 | 0.041586 | 0.038014 |
| nominal_return | 0.314202 | 0.035272 | 0.046071 | 0.142205 | 0.065940 | -0.039453 | 0.065940 | 0.533747 | 0.003025 | 1.000000 | -0.105205 | 0.025856 | 0.072892 | 0.118291 | 0.038219 | 0.152078 | 0.101896 | 0.134162 | 0.049182 |
| ESG_ranking | -0.056352 | 0.041535 | 0.060159 | 0.058317 | -0.079210 | -0.123275 | -0.079210 | -0.088662 | -0.004853 | -0.105205 | 1.000000 | 0.112499 | -0.106382 | 0.164561 | 0.012402 | -0.138176 | 0.118752 | -0.429422 | -0.469440 |
| PE_ratio | -0.030855 | -0.025461 | 0.317998 | 0.304729 | 0.033027 | 0.035551 | 0.033027 | 0.085136 | -0.007222 | 0.025856 | 0.112499 | 1.000000 | -0.042664 | 0.117468 | 0.352464 | -0.036344 | 0.054029 | -0.029921 | -0.064672 |
| EPS_ratio | 0.043956 | -0.005007 | 0.679935 | 0.698000 | -0.225247 | -0.211890 | -0.225247 | 0.038070 | -0.008672 | 0.072892 | -0.106382 | -0.042664 | 1.000000 | 0.078800 | -0.045096 | 0.329119 | 0.130749 | 0.311497 | 0.243385 |
| PS_ratio | 0.058744 | 0.003251 | 0.219801 | 0.222112 | 0.237593 | 0.187969 | 0.237593 | 0.391668 | 0.048510 | 0.118291 | 0.164561 | 0.117468 | 0.078800 | 1.000000 | 0.483868 | 0.522420 | 0.822391 | 0.446526 | -0.029877 |
| PB_ratio | 0.000010 | 0.026403 | 0.367833 | 0.350046 | 0.230903 | 0.176131 | 0.230903 | 0.361000 | 0.116725 | 0.038219 | 0.012402 | 0.352464 | -0.045096 | 0.483868 | 1.000000 | -0.043134 | 0.182139 | 0.117530 | -0.176910 |
| NetProfitMargin_ratio | 0.049919 | -0.072575 | 0.112021 | 0.122658 | -0.156604 | -0.175065 | -0.156604 | 0.199539 | -0.034308 | 0.152078 | -0.138176 | -0.036344 | 0.329119 | 0.522420 | -0.043134 | 1.000000 | 0.474356 | 0.733180 | 0.560262 |
| current_ratio | 0.063964 | -0.033395 | 0.161695 | 0.172211 | 0.194057 | 0.175390 | 0.194057 | 0.263973 | -0.009438 | 0.101896 | 0.118752 | 0.054029 | 0.130749 | 0.822391 | 0.182139 | 0.474356 | 1.000000 | 0.452459 | 0.111891 |
| roa_ratio | 0.059054 | -0.089766 | 0.158972 | 0.161950 | -0.054318 | -0.040506 | -0.054318 | 0.243156 | 0.041586 | 0.134162 | -0.429422 | -0.029921 | 0.311497 | 0.446526 | 0.117530 | 0.733180 | 0.452459 | 1.000000 | 0.710006 |
| roe_ratio | 0.045002 | -0.098832 | 0.004417 | 0.002011 | -0.265809 | -0.258205 | -0.265809 | -0.021053 | 0.038014 | 0.049182 | -0.469440 | -0.064672 | 0.243385 | -0.029877 | -0.176910 | 0.560262 | 0.111891 | 0.710006 | 1.000000 |
heatmap = go.Heatmap(
z=matriz_correlacion.values, # valores de la correlación
x=matriz_correlacion.columns, # nombres de las columnas
y=matriz_correlacion.columns, # nombres de las filas
colorscale='Viridis', # paleta de colores
)
layout = go.Layout(title='Mapa de calor de correlación')
fig = go.Figure(data=[heatmap], layout=layout)
# Mostrar el mapa de calor
#pyo.plot(fig, filename='heatmap.html')
fig.show()
Exploración: Modelo Reg. Logistica#
Como hemos visto anteriormente, la mayor parte de las variables no parecen aportar a conocer la variable respuesta. Por lo que, en modo de exploración realizamos un modelo de regresion de logística para conocer un poco más los datos con los que tratamos.
data_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 horizon_days 900 non-null int64
1 amount 900 non-null float64
2 price_BUY 900 non-null float64
3 price_SELL 900 non-null float64
4 Volatility_Buy 900 non-null float64
5 Volatility_sell 900 non-null float64
6 Sharpe_Ratio 900 non-null float64
7 expected_return_yearly 900 non-null float64
8 inflation 900 non-null float64
9 nominal_return 900 non-null float64
10 investment 900 non-null object
11 ESG_ranking 900 non-null float64
12 PE_ratio 900 non-null float64
13 EPS_ratio 900 non-null float64
14 PS_ratio 900 non-null float64
15 PB_ratio 900 non-null float64
16 NetProfitMargin_ratio 900 non-null float64
17 current_ratio 900 non-null float64
18 roa_ratio 900 non-null float64
19 roe_ratio 900 non-null float64
dtypes: float64(18), int64(1), object(1)
memory usage: 140.8+ KB
modelo_logit = logit('investment ~ horizon_days + amount + price_BUY + price_SELL + Volatility_Buy + Volatility_sell + Sharpe_ratio + expected_return_yearly + inflation + nominal_return + ESG_ranking + PE_ratio + EPS_ratio + PS_ratio + PB_ratio + NetProfitMargin_ratio + current_ratio + roa_ratio + roe_ratio', data=data_1).fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\patsy\compat.py:36, in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
35 try:
---> 36 return f(*args, **kwargs)
37 except Exception as e:
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\patsy\eval.py:169, in EvalEnvironment.eval(self, expr, source_name, inner_namespace)
168 code = compile(expr, source_name, "eval", self.flags, False)
--> 169 return eval(code, {}, VarLookupDict([inner_namespace]
170 + self._namespaces))
File <string>:1
NameError: name 'Sharpe_ratio' is not defined
The above exception was the direct cause of the following exception:
PatsyError Traceback (most recent call last)
Cell In[25], line 1
----> 1 modelo_logit = logit('investment ~ horizon_days + amount + price_BUY + price_SELL + Volatility_Buy + Volatility_sell + Sharpe_ratio + expected_return_yearly + inflation + nominal_return + ESG_ranking + PE_ratio + EPS_ratio + PS_ratio + PB_ratio + NetProfitMargin_ratio + current_ratio + roa_ratio + roe_ratio', data=data_1).fit()
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\statsmodels\base\model.py:203, in Model.from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
200 if missing == 'none': # with patsy it's drop or raise. let's raise.
201 missing = 'raise'
--> 203 tmp = handle_formula_data(data, None, formula, depth=eval_env,
204 missing=missing)
205 ((endog, exog), missing_idx, design_info) = tmp
206 max_endog = cls._formula_max_endog
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\statsmodels\formula\formulatools.py:63, in handle_formula_data(Y, X, formula, depth, missing)
61 else:
62 if data_util._is_using_pandas(Y, None):
---> 63 result = dmatrices(formula, Y, depth, return_type='dataframe',
64 NA_action=na_action)
65 else:
66 result = dmatrices(formula, Y, depth, return_type='dataframe',
67 NA_action=na_action)
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\patsy\highlevel.py:309, in dmatrices(formula_like, data, eval_env, NA_action, return_type)
299 """Construct two design matrices given a formula_like and data.
300
301 This function is identical to :func:`dmatrix`, except that it requires
(...)
306 See :func:`dmatrix` for details.
307 """
308 eval_env = EvalEnvironment.capture(eval_env, reference=1)
--> 309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
310 NA_action, return_type)
311 if lhs.shape[1] == 0:
312 raise PatsyError("model is missing required outcome variables")
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\patsy\highlevel.py:164, in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
162 def data_iter_maker():
163 return iter([data])
--> 164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
168 NA_action=NA_action,
169 return_type=return_type)
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\patsy\highlevel.py:66, in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
64 if isinstance(formula_like, ModelDesc):
65 assert isinstance(eval_env, EvalEnvironment)
---> 66 return design_matrix_builders([formula_like.lhs_termlist,
67 formula_like.rhs_termlist],
68 data_iter_maker,
69 eval_env,
70 NA_action)
71 else:
72 return None
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\patsy\build.py:693, in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
689 factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
690 # Now all the factors have working eval methods, so we can evaluate them
691 # on some data to find out what type of data they return.
692 (num_column_counts,
--> 693 cat_levels_contrasts) = _examine_factor_types(all_factors,
694 factor_states,
695 data_iter_maker,
696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:
699 factor_infos = {}
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\patsy\build.py:443, in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
441 for data in data_iter_maker():
442 for factor in list(examine_needed):
--> 443 value = factor.eval(factor_states[factor], data)
444 if factor in cat_sniffers or guess_categorical(value):
445 if factor not in cat_sniffers:
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\patsy\eval.py:568, in EvalFactor.eval(self, memorize_state, data)
567 def eval(self, memorize_state, data):
--> 568 return self._eval(memorize_state["eval_code"],
569 memorize_state,
570 data)
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\patsy\eval.py:551, in EvalFactor._eval(self, code, memorize_state, data)
549 def _eval(self, code, memorize_state, data):
550 inner_namespace = VarLookupDict([data, memorize_state["transforms"]])
--> 551 return call_and_wrap_exc("Error evaluating factor",
552 self,
553 memorize_state["eval_env"].eval,
554 code,
555 inner_namespace=inner_namespace)
File ~\anaconda3\envs\env_pf_data_viz\lib\site-packages\patsy\compat.py:43, in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
39 new_exc = PatsyError("%s: %s: %s"
40 % (msg, e.__class__.__name__, e),
41 origin)
42 # Use 'exec' to hide this syntax from the Python 2 parser:
---> 43 exec("raise new_exc from e")
44 else:
45 # In python 2, we just let the original exception escape -- better
46 # than destroying the traceback. But if it's a PatsyError, we can
47 # at least set the origin properly.
48 if isinstance(e, PatsyError):
File <string>:1
PatsyError: Error evaluating factor: NameError: name 'Sharpe_ratio' is not defined
investment ~ horizon_days + amount + price_BUY + price_SELL + Volatility_Buy + Volatility_sell + Sharpe_ratio + expected_return_yearly + inflation + nominal_return + ESG_ranking + PE_ratio + EPS_ratio + PS_ratio + PB_ratio + NetProfitMargin_ratio + current_ratio + roa_ratio + roe_ratio
^^^^^^^^^^^^
Verifiquemos qué tipo de datos son significativos en el modelo anteriormente implementado
p_valores = modelo_logit.pvalues
variables_significativas = p_valores[p_valores < 0.05]
variables_significativas
price_BUY 2.475171e-03
price_SELL 6.392773e-04
Volatility_sell 1.408458e-03
expected_return_yearly 4.940344e-02
inflation 7.287026e-21
nominal_return 1.043301e-20
dtype: float64
modelo_logit = sm.Logit(data_1['investment'], data_1[['Volatility_sell', 'expected_return_yearly', 'inflation', 'nominal_return', 'price_SELL','price_BUY']])
resultado_modelo = modelo_logit.fit()
Optimization terminated successfully.
Current function value: 0.211165
Iterations 10
p_valores = resultado_modelo.pvalues
variables_significativas = p_valores[p_valores < 0.05]
variables_significativas
Volatility_sell 2.623637e-13
expected_return_yearly 1.624640e-02
inflation 1.012144e-22
nominal_return 8.670997e-23
price_SELL 1.149075e-04
price_BUY 1.378763e-04
dtype: float64
resultado_modelo.summary()
| Dep. Variable: | investment | No. Observations: | 900 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 894 |
| Method: | MLE | Df Model: | 5 |
| Date: | Sun, 21 Apr 2024 | Pseudo R-squ.: | 0.6825 |
| Time: | 16:12:04 | Log-Likelihood: | -190.05 |
| converged: | True | LL-Null: | -598.63 |
| Covariance Type: | nonrobust | LLR p-value: | 2.246e-174 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Volatility_sell | -6.5713 | 0.899 | -7.312 | 0.000 | -8.333 | -4.810 |
| expected_return_yearly | -4.4793 | 1.864 | -2.403 | 0.016 | -8.132 | -0.826 |
| inflation | -7.1526 | 0.729 | -9.811 | 0.000 | -8.581 | -5.724 |
| nominal_return | 24.9465 | 2.539 | 9.826 | 0.000 | 19.971 | 29.922 |
| price_SELL | -0.0258 | 0.007 | -3.857 | 0.000 | -0.039 | -0.013 |
| price_BUY | 0.0279 | 0.007 | 3.812 | 0.000 | 0.014 | 0.042 |
Possibly complete quasi-separation: A fraction 0.38 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Con el método anteriormente utilizado de exploración podemos confirmar nuevamente que: “inflation” y “nominal return” son las que mayor capacidad de aporte tienen. Seguido de Volatility_sell, expected_return_yearly, nominal_return e inflation.
Como conclusión del proceso exploratorio de datos, podríamos mencionar que la mayor parte de las variables presentes en el data frame no tienen una fuerte inclinación a hacia la variable respuesta categorica asociada. Como fue posible revisar durante los box plots y durante el arbol de regresión expuesto para el estudio de las variables, solamente algunas variables parecen ser determinantes para conocer si una inversión será buena o mala. Adicionalmente, por medio de un modelo de regresion hecho en modo didactivo y exploratorio, fue posible observa que la mayor parte de las variables no son significativas a nivel predictivo.